According to the model selection the best model structure is: y ~ x1 + x2 + x4 + error. Now we’re going to test the model further.
pacman::p_load(ggplot2, ggthemes, tidyr, patchwork, plotly, ggforce)
data = read.csv("data/x_y.csv", header = F)
colnames(data) = c("x", "y")
Construct X matrix
X = cbind(data$x,
data$x^2,
data$x^4
)
Split into train-test
train_X = X[1:200,]
train_y = as.matrix(data$y[1:200])
test_X = X[201:250,]
test_y = as.matrix(data$y[201:250])
Fit model, generate predictions, compute error (SSE, MSE) and adjusted R^2.
theta = solve(crossprod(train_X), crossprod(train_X, train_y))
y_pred = train_X %*% theta
residuals = (train_y - y_pred)
SSE = sum(residuals^2)
sigma_sq = SSE/(nrow(train_X) - 1)
## MSE is: 0.01069316
## Adjusted R^2 is: 0.9999891
Numerical analysis
## Residual analysis:
## Min 25% Median 75% Max
## -0.305319926 -0.061646247 0.008535535 0.077296305 0.291481680
## [,1] [,2] [,3]
## [1,] 5.119680e-05 9.940021e-06 -1.779249e-06
## [2,] 9.940021e-06 4.546752e-05 -5.317726e-06
## [3,] -1.779249e-06 -5.317726e-06 8.751376e-07
Because we have 3 parameters, we’ll need to plot all their combinations resulting in 3 plots.
First we create grid of possible parameter values for which we estimate the uncertainty.
n_points = 50
range = 0.01
x1_grid = seq(theta[1]-range, theta[1]+range, length = n_points)
x2_grid = seq(theta[2]-range, theta[2]+range, length = n_points)
x4_grid = seq(theta[3]-range, theta[3]+range, length = n_points)
t = rbind(x1_grid[50], x2_grid[50], x4_grid[50])
Now we can calculate the pdf
cov_inv = (t(train_X) %*% train_X) * (1/sigma_sq)
cov_det = det(cov)
n_params = 3
Now we’ll generate predictions on training data and 95% confidence intervals
Plot the predictions + CI
(y_y_pred = ggplot(pred_data, aes(x = y, y = y_pred))+
geom_point()+
theme_few())
ggsave("figures/08c_true_pred.png", y_y_pred, width = 7, height = 4)
Validate the model on new train-test split - 70:30
Construct X matrices (train and test)
Re-use the fitting function from model selection
Fit model on training data and predict testing data
## MSE on training data = 0.01094003
## MSE on testing data = 0.008348671
ggplot(result, aes(MSE_train, MSE_test))+
geom_line()